Seminar 3 (Data embedding)

The goal of this seminar is to play around with diffrent techniques for data visualization. We are going work on the well-known MNIST. The dataset consists of 60,000 grayscale images of hand-written digits of size 28 $ \times $ 28. Thus, the dimensionality of the input space is 784.


In [ ]:
%matplotlib inline
from time import time
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.offsetbox import AnnotationBbox, OffsetImage

Visualizing MNIST

As usual we provide the code that fetches the data:


In [ ]:
import os
from sklearn.datasets import fetch_mldata

# Fetch MNIST dataset and create a local copy.
if os.path.exists('mnist.npz'):
    with np.load('mnist.npz', 'r') as data:
        X = data['X']
        y = data['y']
else:
    mnist = fetch_mldata("MNIST original")
    X, y = mnist.data / 255.0, mnist.target
    np.savez('mnist.npz', X=X, y=y)

Now it's your turn to plot some random representatives from each of 10 (obviously) available classes:


In [ ]:
# Preferably arrange images as a 10x10 matrix.

The whole dataset is somewhat large so we restrict ourselves to the random subset of 5,000 images (corresponding indices are held in train_indices):


In [ ]:
n_train_samples = 5000

indices = np.arange(X.shape[0])
np.random.shuffle(indices)

train_indices = indices[: n_train_samples]

PCA, Multidimensional scaling (MDS) and Locally Linear Embedding (LLE)

Your task is to try three different dimensionality reduction techniques which may be helpful for the visualization of high-dimensional data. The first one should be familiar to you: it's PCA with 2 components. The other two are non-linear embedding approaches described in the lecture, namely, MDS and LLE. Plot all three embeddings and check if they are good enough for the dataset.

NOTE: MDS (default settings) may take some time (>10 min) to compute, so be patient. It's probably a good idea to leave the exploration of this method as a hometask.


In [ ]:
from sklearn.decomposition import TruncatedSVD 
from sklearn.manifold import MDS, LocallyLinearEmbedding

# Your code goes here.

Isomap

Now we are going to explore the manifold of 2's using Isomap. First, you need to compute the embedding of the corresponding subset of MNIST.


In [ ]:
from sklearn.manifold import Isomap

indices_of_2 = np.arange(X.shape[0])[y == 2]
np.random.shuffle(indices_of_2)
train_indices_of_2 = indices_of_2[: n_train_samples]

# Your code goes here.

After it is done we can track how an appearance of the digit changes along the line. One can take two most distant points as the endpoints of the interpolation segment. The following code should extract closest points to the line. Use scipy.spatial.KDTree for the fast nearest neighbour computation.


In [ ]:
from scipy.spatial import KDTree
from scipy.spatial.distance import pdist, cdist, squareform
from scipy import linspace

def find_representatives(kdtree, from_point, to_point, n_points):
    # Given two 2D points this function should return a sequence of the dataset representatives (indices) that
    # we encounter nearby as we go from from_point to to_point. This can be done by taking a set points on 
    # the segment and finding corresponding nearest neighbours in the dataset.
    
    # Your code goes here.
    
    return representatives

n_points = 100

# Your code starts here and shoudld end with:
# representatives = find_representatives(kdtree, X_embedded[from_idx, :], X_embedded[to_idx, :], n_points)

Now we define a bunch of helper functions for interpolation visualization. Note the diplay_manifold_flythrough function. First two arguments are array of images and their coordinates on the 2D plane.


In [2]:
from IPython.display import HTML
from matplotlib import animation

VIDEO_TAG = """<video controls>
 <source src="data:video/x-m4v;base64,{0}" type="video/mp4">
 Your browser does not support the video tag.
</video>"""

def anim_to_html(anim):
    if not hasattr(anim, '_encoded_video'):
        with open('temp.mp4', 'wb') as f:
            anim.save(f.name, fps=20, extra_args=['-vcodec', 'libx264'], writer='ffmpeg')
            video = open('temp.mp4', "rb").read()
        anim._encoded_video = video.encode("base64")
    
    return VIDEO_TAG.format(anim._encoded_video)

def display_animation(anim):
    plt.close(anim._fig)
    return HTML(anim_to_html(anim))

def diplay_manifold_flythrough(X, coords, fig, ax):
    imagebox = OffsetImage(X[0].reshape(28, 28), cmap=plt.cm.gray_r)
    annbox = AnnotationBbox(imagebox, coords[0])
    ax.add_artist(annbox)

    def init():
        return imagebox, annbox

    def animate(i):
        imagebox.set_data(X[i].reshape(28, 28))
        annbox.xyann = coords[i]

        return imagebox, annbox

    anim = animation.FuncAnimation(fig, animate, init_func=init, frames=X.shape[0], interval=20, blit=True)

    return display_animation(anim)

Now use diplay_manifold_flythrough to display the manifold fly-through over the scatter plot of the subset:


In [ ]:
fig = plt.figure(figsize=(10, 10))
ax = plt.axes(frameon=False)
plt.setp(ax, xticks=(), yticks=())

# Your code starts here and should end with:
# diplay_manifold_flythrough(..., ..., fig, ax)

Additionally create an animation of the interpolation. Specifically, you should obtain an animation similar to the one presented below:


In [4]:
HTML("""<video controls>
<source src='./interpolation.mp4' type='video/mp4'>
Your browser does not support the video tag.
</video>""")


Out[4]:

In [ ]:
def display_interpolation(X, steps, fig):
    # NOTE: First argument corresponds to the sequence of images.
    
    n_images = X.shape[0]
    
    im = plt.imshow(X[0].reshape(28, 28), cmap=plt.cm.gray_r)

    def init():
        return im,
    
    def animate(i):
        # Your code goes here.
        
        im.set_array(img)
        return im,

    anim = animation.FuncAnimation(fig, animate, init_func=init, frames=steps, interval=20, blit=True)

    return display_animation(anim)


fig = plt.figure(figsize=(3, 3))
ax = plt.axes(frameon=False)
plt.setp(ax, xticks=(), yticks=())

# Your code starts here and should end with:
# display_interpolation(.., 500, fig)

t-Distributed Stochastic Neighbor Embedding (t-SNE)

If we need to visualize the whole set of digits, a much better embedding technique is t-SNE. This method has been proven to generate high-quality visualizations in various scenarios and has become a standard de-facto in analyzing high-dimensional data (like activities of neurons in an Artificial Neural Network). We are going to use a slower version of the algorithm supplied in the scikit-learn (sklearn.manifold.TSNE). In practice, however, I'd suggest to use an approximate Barnes-Hut t-SNE which produces comparable results while working substantially faster.

Your task here is similar to the ones in the previous section. The differeces are:

  • You should use all digits (a train_indices subset)
  • The fly-through should capture all classes. One way to achieve this would be to use picker capabilities of matplotlib (demo) in order to manually pick a sequence of keypoints on a scatter plot. Given a list of keypoints, it's easy to obtain intermediate representatives using the find_representatives function. The rest of the code should be almost identical to the Isomap-case. Don't forget about the interpolation animation :)

NOTE 1: t-SNE is available in the latest version of scikit-learn. There is a python package called tsne (pip install tsne) which implements Barnes-Hut t-SNE. Feel free to use that one.

NOTE 2: It may be a good idea to apply PCA (with, for example, 50 components) to the data first.


In [ ]:
from sklearn.manifold import TSNE

# Your code goes here.